Scalable, Parameter- and Memory-Efficient Pretraining for Large Language Models

Recent Algorithmic Advances and Comprehensive Benchmarking

Published

May 28, 2025

Authors: A. Glentis et al.
Published on Arxiv: 2025-05-28
Link: http://arxiv.org/abs/2505.22922v1
Institutions: University of Minnesota • Peking University • University of Sydney
Keywords: large language models, parameter-efficient pre-training, memory-efficient optimization, low-rank factorization, weight refactorization, momentum reset, GaLore, Fira, SLTrain, LLaMA, LoRA, C4 dataset, scaling laws, AdamW, model compression, benchmarking

Random Unsplash-style image

The exponential growth in the scale of large language models (LLMs), now reaching trillions of parameters, is driving significant challenges for both computation and memory, especially during pre-training and fine-tuning phases. Techniques for parameter-efficient fine-tuning like LoRA have succeeded in downstream tasks, but applying such efficiency methods directly to LLM pre-training remains difficult due to scale and data requirements.

To address these issues, the authors conducted an in-depth examination of current strategies and proposed new practical improvements:

Building on these approaches, their benchmarking uncovered several notable findings:

These results lead to several important conclusions and directions for future work: